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Abstract 

Multiclass prediction is the problem of clas- 
sifying an object into a relevant target class. 
We consider the problem of learning a mul- 
ticlass predictor that uses only few features, 
and in particular, the number of used features 
should increase sub-linearly with the number 
of possible classes. This implies that features 
should be shared by several classes. We de- 
scribe and analyze the ShareBoost algorithm 
for learning a multiclass predictor that uses 
few shared features. We prove that Share- 
Boost efficiently finds a predictor that uses 
few shared features (if such a predictor ex- 
ists) and that it has a small generalization er- 
ror. We also describe how to use ShareBoost 
for learning a non-linear predictor that has 
a fast evaluation time. In a series of experi- 
ments with natural data sets we demonstrate 
the benefits of ShareBoost and evaluate its 
success relatively to other state-of-the-art ap- 
proaches. 

1. Introduction 

Learning to classify an object into a relevant target 
class surfaces in many domains such as document cate- 
gorization, object recognition in computer vision, and 
web advertisement. In multiclass learning problems 
we use training examples to learn a classifier which 
will later be used for accurately classifying new ob- 
jects. Typically, the classifier first calculates several 
features from the input object and then classifies the 
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object based on those features. In many cases, it is im- 
portant that the runtime of the learned classifier will 
be small. In particular, this requires that the learned 
classifier will only rely on the value of few features. 

We start with predictors that are based on linear com- 
binations of features. Later, in Section 4, we show how 
our framework enables learning highly non-linear pre- 
dictors by embedding non-linearity in the construction 
of the features. Requiring the classifier to depend on 
few features is therefore equivalent to sparseness of the 
linear weights of features. In recent years, the problem 
of learning sparse vectors for linear classification or re- 
gression has been given significant attention. While, 
in general, finding the most accurate sparse predic- 
tor is known to be NP hard (Natarajan, 1995; Davis 
et al., 1997), two main approaches have been proposed 
for overcoming the hardness result. The first approach 
uses l\ norm as a surrogate for sparsity (e.g. the Lasso 
algorithm (Tibshirani, 1996) and the compressed sens- 
ing literature (Candes & Tao, 2005; Donoho, 2006)). 
The second approach relies on forward greedy selection 
of features (e.g. Boosting (Freund & Schapirc, 1999) in 
the machine learning literature and orthogonal match- 
ing pursuit in the signal processing community (Tropp 
& Gilbert, 2007)). 

A popular model for multiclass predictors maintains a 
weight vector for each one of the classes. In such case, 
even if the weight vector associated with each class is 
sparse, the overall number of used features might grow 
with the number of classes. Since the number of classes 
can be rather large, and our goal is to learn a model 
with an overall small number of features, we would like 
that the weight vectors will share the features with 
non-zero weights as much as possible. Organizing the 
weight vectors of all classes as rows of a single matrix, 
this is equivalent to requiring sparsity of the columns 
of the matrix. 
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In this paper we describe and analyze an efficient al- 
gorithm for learning a multiclass predictor whose cor- 
responding matrix of weights has a small number of 
non-zero columns. We formally prove that if there 
exists an accurate matrix with a number of non-zero 
columns that grows sub-linearly with the number of 
classes, then our algorithm will also learn such a ma- 
trix. We apply our algorithm to natural multiclass 
learning problems and demonstrate its advantages over 
previously proposed state-of-the-art methods. 

Our algorithm is a generalization of the forward greedy 
selection approach to sparsity in columns. An alter- 
native approach, which has recently been studied in 
(Quattoni et al., 2009; Duchi & Singer, 2009), general- 
izes the t\ norm based approach, and relies on mixed- 
norms. We discuss the advantages of the greedy ap- 
proach over mixed-norms in Section 1.2. 

1.1. Formal problem statement 

Let V be the set of objects we would like to classify. 
For example, V can be the set of gray scale images 
of a certain size. For each object v £ V, we have a 
pool of predefined d features, each of which is a real 
number in [—1, 1]. That is, we can represent each v £ 
V as a vector of features x £ [—1, \] d . We note that 
the mapping from v to x can be non-linear and that 
d can be very large. For example, we can define x 
so that each element Xi corresponds to some patch, 
p £ {±1} 9X9 , and a threshold 9, where Xi equals 1 if 
there is a patch of v whose inner product with p is 
higher than 9. We discuss some generic methods for 
constructing features in Section 4. From this point 
onward we assume that x is given. 

The set of possible classes is denoted by y = 
{1, ...,&}. Our goal is to learn a multiclass predic- 
tor, which is a mapping from the features of an object 
into y. We focus on the set of predictors parametrized 
by matrices W £ K fc ' d that takes the following form: 

hw{*) — argmax(VFx) y . (1) 
yey 

That is, the matrix W maps each d-dimensional fea- 
ture vector into a fc-dimensional score vector, and the 
actual prediction is the index of the maximal element 
of the score vector. If the maximizer is not unique, we 
break ties arbitrarily. 

Recall that our goal is to find a matrix W with few 
non-zero columns. We denote by W..i the i'th column 
of W and use the notation 

||W]]oo,o = |{* : ||W.,<]|oo > 0}| 
to denote the number of columns of W which are not 



identically the zero vector. More generally, given a 
matrix W and a pair of norms || ■ || p , || ■ || r we denote 
WWp.r = \\(\\W.,i\\ p , . . . , ||W. l( j||p)|| r) that is, we apply 
the p-norm on the columns of W and the r-norm on 
the resulting <i-dimensional vector. 

The 04 loss of a multiclass predictor hw on an example 
(x, y) is defined as l[hw( x ) 7^ y]- That is, the 0-1 loss 
equals 1 if hw(x) 7^ y and otherwise. Since this loss 
function is not convex with respect to W, we use a 
surrogate convex loss function based on the following 
easy to verify inequalities: 

i[Mx) ^ y] < i[M*) * y] - (Wx)„ + (Wx) ww 

< max lk/ ^ y] - (W^x), + (JFxV 

(2) 

< In £ gVH-W.+W,' . ( 3 ) 

y'&y 

We use the notation £(W, (x, y)) to denote the right- 
hand side (eqn. (3)) of the above. The loss given 
in eqn. (2) is the multi-class hinge loss (Crammer 
& Singer, 2003) used in Support- Vector-Machines, 
whereas £(W, (x, y)) is the result of performing a "soft- 
max" operation: max x f(x) < (l/p)\n^2 x e p ^ x \ 
where equality holds for p — > oo. 

This logistic multiclass loss function l(W, (x, y)) has 
several nice properties — see for example (Zhang, 
2004) . Besides being a convex upper-bound on the 0-1 
loss, it is smooth. The reason we need the loss func- 
tion to be both convex and smooth is as follows. If a 
function is convex, then its first order approximation 
at any point gives us a lower bound on the function 
at any other point. When the function is also smooth, 
the first order approximation gives us both lower and 
upper bounds on the value of the function at any other 
point 1 . ShareBoost uses the gradient of the loss func- 
tion at the current solution (i.e. the first order approx- 
imation of the loss) to make a greedy choice of which 
column to update. To ensure that this greedy choice 
indeed yields a significant improvement we must know 
that the first order approximation is indeed close to 
the actual loss function, and for that we need both 
lower and upper bounds on the quality of the first or- 
der approximation. 

Given a training set S — (xi, yi), . , . , (x m , y m ), the 
average training loss of a matrix W is: L(W) — 
h, T,( x , y )eS Z( W > ( X >V))- We aim at approximately 

Smoothness guarantees that |/(ar) — f(x') — V/(a;')(x — 
x ')\ < P\\ x ~ a^' || 2 for some j3 and all x,x' . Therefore one 
can approximate f(x) by f(x') + Vf(x')(x — x') and the 
approximation error is upper bounded by the difference 
between x, x' . 
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solving the problem 

min L(W) s.t. ||W||oc,o < a . (4) 

WGR k ' d 

That is, find the matrix W with minimal training loss 
among all matrices with column sparsity of at most s, 
where s is a user-defined parameter. Since £(W, (x, y)) 
is an upper bound on l[/iw(x) ^ y], by minimizing 
L(W) we also decrease the average 0— f error of W 
over the training set. In Section 5 we show that for 
sparse models, a small training error is likely to yield 
a small error on unseen examples as well. 

Regrettably, the constraint llVFlloo.o < s in eqn. (4) 
is non-convex, and solving the optimization problem 
in eqn. (4) is NP-hard (Natarajan, 1995; Davis et al., 
1997). To overcome the hardness result, the Share- 
Boost algorithm will follow the forward greedy selec- 
tion approach. The algorithm comes with formal gen- 
eralization and sparsity guarantees (described in Sec- 
tion 5) that makes ShareBoost an attractive multiclass 
learning engine due to efficiency (both during training 
and at test time) and accuracy. 

1.2. Related Work 

The centrality of the multiclass learning problem has 
spurred the development of various approaches for 
tackling the task. Perhaps the most straightforward 
approach is a reduction from multiclass to binary, e.g. 
the one-vs-rest or all pairs constructions. The more 
direct approach we choose, in particular, the multi- 
class predictors of the form given in eqn. (1), has been 
extensively studied and showed a great success in prac- 
tice — see for example (Duda & Hart, 1973; Vapnik, 
1998; Crammer & Singer, 2003). 

An alternative construction, abbreviated as the single- 
vector model, shares a single weight vector, for all the 
classes, paired with class-specific feature mappings. 
This construction is common in generalized additive 
models (Hastie & Tibshirani, 1995), multiclass ver- 
sions of boosting (Freund & Schapire, 1997; Schapire 
& Singer, 1999), and has been popularized lately due 
to its role in prediction with structured output where 
the number of classes is exponentially large (see e.g. 
(Taskar et al., 2003)). While this approach can yield 
predictors with a rather mild dependency of the re- 
quired features on k (see for example the analysis in 
(Zhang, 2004; Taskar et al., 2003; Fink ct al., 2006)), 
it relies on a-priori assumptions on the structure of X 
and y. In contrast, in this paper we tackle general 
multiclass prediction problems, like object recognition 
or document classification, where it is not straight- 
forward or even plausible how one would go about to 
construct a-priori good class specific feature mappings, 



and therefore the single-vector model is not adequate. 

The class of predictors of the form given in eqn. (1) 
can be trained using Frobenius norm regularization (as 
done by multiclass SVM - see e.g. (Crammer & Singer, 
2003)) or using l\ regularization over all the entries of 
W. However, as pointed out in (Quattoni et al., 2009), 
these regularizers might yield a matrix with many non- 
zeros columns, and hence, will lead to a predictor that 
uses many features. 

The alternative approach, and the most relevant to 
our work, is the use of mix-norm regularizations like 
||W||oc,i or \\W\\ 2 ,i (Lanckriet et al., 2004; Turlach 
ct al., 2000; Argyriou ct al., 2006; Bach, 2008; Quattoni 
ct al., 2009; Duchi & Singer, 2009; Huang & Zhang, 
2010). For example, (Duchi & Singer, 2009) solves the 
following problem: 

min L(W) + A|| W\\ oo,i • (5) 

WdS, k - d 

which can be viewed as a convex approximation of our 
objective (eqn. (4)). This is advantageous from an op- 
timization point of view, as one can find the global 
optimum of a convex problem, but it remains unclear 
how well the convex program approximates the orig- 
inal goal. For example, in Section 6 we show cases 
where mix-norm regularization does not yield sparse 
solutions while ShareBoost does yield a sparse solu- 
tion. Despite the fact that ShareBoost tackles a non- 
convex program, and thus limited to local optimum 
solutions, we prove in Theorem 2 that under mild con- 
ditions ShareBoost is guaranteed to find an accurate 
sparse solution whenever such a solution exists and 
that the generalization error is bounded as shown in 
Theorem 1. 

We note that several recent papers (e.g. (Huang & 
Zhang, 2010)) established exact recovery guarantees 
for mixed norms, which may seem to be stronger 
than our guarantee given in Theorem 2. However, 
the assumptions in (Huang & Zhang, 2010) are much 
stronger than the assumptions of Theorem 2. In 
particular, they have strong noise assumptions and 
a group RIP like assumption (Assumption 4.1-4.3 in 
their paper). In contrast, we impose no such restric- 
tions. We would like to stress that in many generic 
practical cases, the assumptions of (Huang & Zhang, 
2010) will not hold. For example, when using decision 
stumps, features will be highly correlated which will 
violate Assumption 4.3 of (Huang & Zhang, 2010). 

Another advantage of ShareBoost is that its only pa- 
rameter is the desired number of non-zero columns of 
W . Furthermore, obtaining the whole-regularization- 
path of ShareBoost, that is, the curve of accuracy as a 
function of sparsity, can be performed by a single run 
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of ShareBoost, which is much easier than obtaining 
the whole regularization path of the convex relaxation 
in eqn. (5). Last but not least, ShareBoost can work 
even when the initial number of features, d, is very 
large, as long as there is an efficient way to choose the 
next feature. For example, when the features are con- 
structed using decision stumps, d will be extremely 
large, but ShareBoost can still be implemented effi- 
ciently. In contrast, when d is extremely large mix- 
norm regularization techniques yield challenging opti- 
mization problems. 

As mentioned before, ShareBoost follows the forward 
greedy selection approach for tackling the hardness of 
solving eqn. (4) . The greedy approach has been widely 
studied in the context of learning sparse predictors 
for linear regression. However, in multiclass problems, 
one needs sparsity of groups of variables (columns of 
W) . ShareBoost generalizes the fully corrective greedy 
selection procedure given in (Shalcv-Shwartz ct al., 
2010) to the case of selection of groups of variables, 
and our analysis follows similar techniques. 

Obtaining group sparsity by greedy methods has been 
also recently studied in (Huang et al., 2009; Majumdar 
& Ward, 2009), and indeed, ShareBoost shares simi- 
larities with these works. We differ from (Huang et al., 
2009) in that our analysis does not impose strong as- 
sumptions (e.g. group- RIP) and so ShareBoost applies 
to a much wider array of applications. In addition, the 
specific criterion for choosing the next feature is differ- 
ent. In (Huang ct al., 2009), a ratio between difference 
in objective and different in costs is used. In Share- 
Boost, the LI norm of the gradient matrix is used. 
For the multiclass problem with log loss, the criterion 
of ShareBoost is much easier to compute, especially 
in large scale problems. (Majumdar & Ward, 2009) 
suggested many other selection rules that are geared 
toward the squared loss, which is far from being an 
optimal loss function for multiclass problems. 

Another related method is the JointBoost algorithm 
(Torralba et al., 2007). While the original presentation 
in (Torralba et al., 2007) seems rather different than 
the type of predictors we describe in eqn. (1), it is pos- 
sible to show that JointBoost in fact learns a matrix W 
with additional constraints. In particular, the features 
x are assumed to be decision stumps and each column 
is constrained to be «i(l[l € Q] , . . . , l[k £ Cj]), 
where a; £ f and Ci C y. That is, the stump is shared 
by all classes in the subset Cj. JointBoost chooses such 
shared decision stumps in a greedy manner by apply- 
ing the GentleBoost algorithm on top of this presenta- 
tion. A major disadvantage of JointBoost is that in its 
pure form, it should exhaustively search C among all 



2 k possible subsets of y. In practice, (Torralba et al., 
2007) relies on heuristics for finding C on each boost- 
ing step. In contrast, ShareBoost allows the columns 
of W to be any real numbers, thus allowing "soft" 
sharing between classes. Therefore, ShareBoost has 
the same (or even richer) expressive power comparing 
to JointBoost. Moreover, ShareBoost automatically 
identifies the relatedness between classes (correspond- 
ing to choosing the set C) without having to rely on 
exhaustive search. ShareBoost is also fully corrective, 
in the sense that it extracts all the information from 
the selected features before adding new ones. This 
leads to higher accuracy while using less features as 
was shown in our experiments on image classification. 
Lastly, ShareBoost comes with theoretical guarantees. 

Finally, we mention that feature sharing is merely one 
way for transferring information across classes (Thrun, 
1996) and several alternative ways have been proposed 
in the literature such as target embedding (Hsu et al., 
2010; Bcngio ct al., 2011), shared hidden structure (Le- 
Cun et al., 1998; Amit et al., 2007), shared prototypes 
(Quattoni et al., 2008), or sharing underlying metric 
(Xing et al., 2003). 

2. The ShareBoost Algorithm 

ShareBoost is a forward greedy selection approach for 
solving eqn. (4) . Usually, in a greedy approach, we up- 
date the weight of one feature at a time. Now, we will 
update one column of W at a time (since the desired 
sparsity is over columns). We will choose the column 
that maximizes the l\ norm of the corresponding col- 
umn of the gradient of the loss at W . Since W is a 
matrix we have that VL(W) is a matrix of the partial 
derivatives of L. Denote by V r L(W) the r'th column 

9L(W) \ A 



of VL(W), that is, the vector 
standard calculation shows that 



' dW k , r J ' 



dL(W) 

aw, 



q.r 



where 



— Y] Y]pc(x,y)x r (l[q = c]-l[q = 



e ]|c54«]-(ffx),+(ffx) e 
V p %'A]-(H'x),+(ffx) | ,, 



(6) 



Note that ^ c p c (x, y) = 1 for all (x, y). Therefore, we 
can rewrite, 



dL(W) _ 1 



^2 x r(Pq( x ,V) - l[q = y]) 



( x .y) 
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Based on the above we have 



\V r L(W)\U = ~J2 
m *■ — ' 



qey 



x r(pq( x >y) - ![« = y}) 

( x >2/) 



(7) 

Finally, after choosing the column for which 
||V r L(W)||i is maximized, we re-optimize all the 
columns of W which were selected so far. The re- 
sulting algorithm is given in Algorithm 1. 

Algorithm 1 ShareBoost 
1: Initialize: W = ; 7 = 
2: for t=l,2,. . . ,T do 

3: For each class c and example (x, y) define 

p c (x, y) as in eqn. (6) 
4: Choose feature r that maximizes the right-hand 

side of eqn. (7) 
5: I^lU{r} 

6: Set W <- argmin w L(W) s.t. W. A = for all 

i i I 
7: end for 



The runtime of ShareBoost is as follows. Steps 3-5 
requires 0(mdk). Step 6 is a convex optimization 
problem in tk variables and can be performed using 
various methods. In our experiments, we used Nes- 
terov's accelerated gradient method (Nesterov & Nes- 
terov, 2004) whose runtime is 0(mtk/y/e) for a smooth 
objective, where e is the desired accuracy. Therefore, 
the overall runtime is 0(Tmdk + T 2 mkj '\/e). It is in- 
teresting to compare this runtime to the complexity 
of minimizing the mixed-norm regularization objec- 
tive given in eqn. (5). Since the objective is no longer 
smooth, the runtime of using Nesterov's accelerated 
method would be 0{mdk/e) which can be much larger 
than the runtime of ShareBoost when d^$> T. 

3. Variants of ShareBoost 

We now describe several variants of ShareBoost. The 
analysis we present in Section 5 can be easily adapted 
for these variants as well. 

3.1. Modifying the Greedy Choice Rule 

ShareBoost chooses the feature r which maximizes the 
l\ norm of the r-th column of the gradient matrix. 
Our analysis shows that this choice leads to a sufficient 
decrease of the objective function. However, one can 
easily develop other ways for choosing a feature which 
may potentially lead to an even larger decrease of the 
objective. For example, we can choose a feature r 
that minimizes L(W) over matrices W with support of 



lU{r}. This will lead to the maximal possible decrease 
of the objective function at the current iteration. Of 
course, the runtime of choosing r will now be much 
larger. Some intermediate options are to choose r that 
minimizes 



min W + aV r R(W) 

a£R 

or to choose r that minimizes 

min W + we I , 

w£R fe 

where ej is the all-zero row vector except 1 in the r'th 
position. 

3.2. Selecting a Group of Features at a Time 

In some situations, features can be divided into groups 
where the runtime of calculating a single feature in 
each group is almost the same as the runtime of cal- 
culating all features in the group. In such cases, it 
makes sense to choose groups of features at each iter- 
ation of ShareBoost. This can be easily done by sim- 
ply choosing the group of features J that maximizes 
E^llViiWHr. 

3.3. Adding Regularization 

Our analysis implies that when \S\ is significantly 
larger than 0(Tk) then ShareBoost will not overfit. 
When this is not the case, we can incorporate reg- 
ularization in the objective of ShareBoost in order 
to prevent overfitting. One simple way is to add to 
the objective function L(W) a Frobenius norm reg- 
ularization term of the form AY\- „■ W 7 ?,-, where A is 
a regularization parameter. It is easy to verify that 
this is a smooth and convex function and therefore we 
can easily adapt ShareBoost to deal with this regu- 
larized objective. It is also possible to rely on other 
norms such as the l\ norm or the lca/£i mixed- norm. 
However, there is one technicality due to the fact 
that these norms are not smooth. We can overcome 
this problem by defining smooth approximations to 
these norms. The main idea is to first note that for 
a scalar a we have |a| = max{a, — a} and therefore 
we can rewrite the aforementioned norms using max 
and sum operations. Then, we can replace each max 
expression with its soft-max counterpart and obtain 
a smooth version of the overall norm function. For 
example, a smooth version of the £00/ £i norm will 



be U^llod « ^E?=i lo s(Eti(e^ +e-^)j 
where (3 > 1 controls the tradeoff between quality of 
approximation and smoothness. 
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Figure 1. Motivating super vectors. 

4. Non-Linear Prediction Rules 

We now demonstrate how ShareBoost can be used 
for learning non-linear predictors. The main idea is 
similar to the approach taken by Boosting and SVM. 
That is, we construct a non-linear predictor by first 
mapping the original features into a higher dimen- 
sional space and then learning a linear predictor in 
that space, which corresponds to a non-linear predic- 
tor over the original feature space. To illustrate this 
idea we present two concrete mappings. The first is 
the decision stumps method which is widely used by 
Boosting algorithms. The second approach shows how 
to use ShareBoost for learning piece- wise linear predic- 
tors and is inspired by the super-vectors construction 
recently described in (Zhou et al., 2010). 

4.1. ShareBoost with Decision Stumps 

Let v g W be the original feature vector represent- 
ing an object. A decision stump is a binary feature 
of the form l[vi < 8], for some feature i E {1, . . . ,p} 
and threshold 9 G R. To construct a non-linear predic- 
tor we can map each object v into a feature-vector x 
that contains all possible decision stumps. Naturally, 
the dimensionality of x is very large (in fact, can even 
be infinite), and calculating Step 4 of ShareBoost may 
take forever. Luckily, a simple trick yields an efficient 
solution. First note that for each i, all stump features 
corresponding to i can get at most m + 1 values on a 
training set of size m. Therefore, if we sort the values 
of vi over the m examples in the training set, we can 
calculate the value of the right-hand side of eqn. (7) 
for all possible values of 9 in total time of 0(m). Thus, 
ShareBoost can be implemented efficiently with deci- 
sion stumps. 

4.2. Learning Piece-wise Linear Predictors 
with ShareBoost 

To motivate our next construction let us consider first 
a simple one dimensional function estimation problem. 
Given sample (xi, yi), . . . , (x m , y m ) we would like to 
find a function / : K — > K such that f(xi) « yi for 



all i. The class of piece- wise linear functions can be 
a good candidate for the approximation function /. 
See for example an illustration in Fig. 1. In fact, it is 
easy to verify that all smooth functions can be approx- 
imated by piece-wise linear functions (see for example 
the discussion in (Zhou et al., 2010)). In general, we 
can express piece-wise linear vector-valued functions 
as 

/(v)=X>[||v-v,|| <r j ]({u j ,v) + b j ) , 
i=i 

where q is the number of pieces, (uj, bj) represents the 
linear function corresponding to piece j, and (v.,-, rj) 
represents the center and radius of piece j. This ex- 
pression can be also written as a linear function over 
a different domain, /(v) = (w,ip(y)} where 

^(v) = [l[||v - Vl || < n] [v , 1] , . . . , l[||v - v,|| < r„] [v , 1] ] . 

In the case of learning a multiclass predictor, we shall 
learn a predictor v H> Wi[>(v), where W will be a 
k by dim(-0(v)) matrix. ShareBoost can be used for 
learning W. Furthermore, we can apply the variant of 
ShareBoost described in Section 3.2 to learn a piece- 
wise linear model which few pieces (that is, each group 
of features will correspond to one piece of the model). 
In practice, we first define a large set of candidate cen- 
ters by applying some clustering method to the train- 
ing examples, and second we define a set of possible 
radiuses by taking values of quantiles from the training 
examples. Then, we train ShareBoost so as to choose 
a multiclass predictor that only use few pairs (vj , Tj). 

The advantage of using ShareBoost here is that while 
it learns a non-linear model it will try to find a model 
with few linear "pieces" , which is advantageous both 
in terms of test runtime as well as in terms of gener- 
alization performance. 

5. Analysis 

In this section we provide formal guarantees for the 
ShareBoost algorithm. The proofs are deferred to the 
appendix. We first show that if the algorithm has 
managed to find a matrix W with a small number of 
non-zero columns and a small training error, then the 
generalization error of W is also small. The bound 
below is in terms of the 0—1 loss. A related bound, 
which is given in terms of the convex loss function, is 
described in (Zhang, 2004). 

Theorem 1 Suppose that the ShareBoost algorithm 
runs for T iterations and let W be its output matrix. 
Then, with probability of at least 1 — 8 over the choice 
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of the training set S we have that 

P Jh w (x)^y]< P Mx)? 4 !/] 



'Tfelog(Tfe) log(fc) + Tlog(d) + log(l/5) 



|5| 



Next, we analyze the sparsity guarantees of Share- 
Boost. As mentioned previously exactly solving 
eqn. (4) is known to be NP hard. The following main 
theorem gives an interesting approximation guarantee. 
It tells us that if there exists an accurate solution with 
small ice. i norm, then the ShareBoost algorithm will 
find a good sparse solution. 

Theorem 2 Let e > and let W* be an arbitrary ma- 
trix. Assume that we run the ShareBoost algorithm for 
T = [4 \ HW*!!^ i] iterations and let W be the output 
matrix. ' Then, ||#||oo,o < T and L{W) < L(W*) + e. 



6. Feature Sharing 
Examples 



Illustrative 



In this section we present illustrative examples, show- 
ing that whenever strong feature sharing is possible 
then ShareBoost will find it, while competitive meth- 
ods might fail to produce solutions with a small num- 
ber of features. 

In the analysis of the examples below we use the fol- 
lowing simple corollary of Theorem 2. 

Corollary 1 Assume that there exists a matrix W* 
such that L(W*) < e, all entries ofW* are in [— c, c], 
and || W* || oo,0 = r - Then, ShareBoost will find a ma- 
trix W with L(W) < 2e and ||W|U,o < 4r 2 c 2 /e. 

The first example we present shows an exponential gap 
between the number of features required by Share- 
Boost (as well as mixed norms) and the number of 
features required by £2 or l\ regularization methods. 
Consider a set of examples such that each exam- 
ple, (x, y), is of the form x = [bin(j/), 2 log(fc) e v ] e 
M io g (fc)+fc ; where bin ( y ) g | ±1 }io g (fe) ig the binary rep _ 

resentation of the number y in the alphabet {±1} and 
e y is the vector which is zero everywhere except 1 
in the y'th coordinate. For example, if k = 4 then 
bin(l) = [-1,1], bin(2) = [1,-1], bin(3) = [1,1], and 
bin(4) = [-1,-1]. 

Consider two matrices. The first matrix, de- 
noted W( s \ is the matrix whose row y equals to 
[bin(y), (0, . . . , 0)]. The second matrix, denoted 
is the matrix whose row y equals to [(0, . . . , 0), e v }. 
Clearly, the number of features used by h W (s) is log(fc) 
while the number of features used by h W (j) is k. 



Observe that both h W (f) (x) and h W ( S) (x) (see defi- 
nition in eqn. (1)), will make perfect predictions on 
the training set. Furthermore, since for each exam- 
ple (x, y), for each r / j we have that (W^x) r g 
[— log(fc), log(fc) — 2], for the logistic multiclass loss, 
for any c > we have that 

L{cW u) ) = log(l + (k — i) e i" 2cl °g( fc )) 
< L{cW {s) ) 

<log(l + (A:-l)e 1 - c ( los (' £ )- 2 )) . 



It follows that for 

1 + logffc 



c > 



log(e e 



log(fc) 



we have that L(cW^) < e. 

Consider an algorithm that solves the regularized 
problem 

minL(W)+X\\W\\ p<p , 



where p is either 1 or 2. 

\\w^\\ P , P < \\w(*%, p . 



In both cases, we have that 2 
It follows that for any value 
of A, and for any c > 0, the value of the objective at 
cW^ is smaller than the value at cW^ s K In fact, it 
is not hard to show that the optimal solution takes 
the form cW^ for some c > 0. Therefore, no matter 
what the regularization parameter A is, the solution of 
the above regularized problem will use k features, even 
though there exists a rather good solution that relies 
on log(fc) shared features. 

In contrast, using Corollary 1 we know that if we stop 
ShareBoost after poly(log(fc)) iterations it will pro- 
duce a matrix that uses only poly(log(fc)) features and 
has a small loss. Similarly, it is possible to show that 
for an appropriate regularization parameter, the mix- 
norm regularization ||VF||oo,i will also yield the matrix 
W( s > rather than the matrix W"K 

In our second example we show that in some situations 
using the mix-norm regularization, 



mm L(W) 
w 



A||W||c 



will also fail to produce a sparse solution, while Share- 
Boost is still guaranteed to learn a sparse solution. Let 
s be an integer and consider examples (x, y) where 
each x is composed of s blocks, each of which is in 
{±l} log ( fc ). We consider two types of examples. In 
the first type, each block of x equals to bin(y). In 
the second type, we generate example as in the first 
type, but then we zero one of the blocks (where we 



\W U) \ 



k whereas ||W (s) 



p 

p,p 



fclog(fc). 
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choose uniformly at random which block to zero). As 
before, (1 — e)m examples are of the first type while 
era examples are of the second type. 

Consider again two matrices. The first matrix, 
denoted W^ s \ is the matrix whose row y equals 
to [bin(y), (0, . . . , 0)]. The second matrix, de- 
noted is the matrix whose row y equals 
to [bin(y),...,bin(y)]/s. Note that ||W (/ )||oo,i = 
||W / ^ s ' ) ||oo,i- In addition, for any (x, y) of the second 
type we have that E[lfWx] = VT^x, where expecta- 
tion is with respect to the choice of which block to zero. 
Since the loss function is strictly convex, it follows 
from Jensen's inequality that L(W^) < L(W^). We 
have thus shown that using the (oo, 1) mix- norm as a 
regularization will prefer the matrix W^' over 
In fact, it is possible to show that the minimizer of 
L(W) + A|| V^Hoo^i will be of the form cW^' for some 
c. Since the number of blocks, s, was arbitrarily large, 
and since ShareBoost is guaranteed to learn a matrix 
with at most poly(log(/c)) non-zero columns, we con- 
clude that there can be a substantial gap between mix- 
norm regularization and ShareBoost. The advantage 
of ShareBoost in this example follows from its ability 
to break ties (even in an arbitrary way). 

Naturally, the aforementioned examples are synthetic 
and capture extreme situations. However, in our ex- 
periments below we show that ShareBoost performs 
better than mixed-norm regularization on natural data 
sets as well. 

7. Experiments 

In this section we demonstrate the merits (and pitfalls) 
of ShareBoost by comparing it to alternative algo- 
rithms in different scenarios. The first experiment ex- 
emplifies the feature sharing property of ShareBoost. 
We perform experiments with an OCR data set and 
demonstrate a mild growth of the number of features 
as the number of classes grows from 2 to 36. The sec- 
ond experiment compares ShareBoost to mixed-norm 
regularization and to the JointBoost algorithm of (Tor- 
ralba et al., 2007). We follow the same experimental 
setup as in (Duchi & Singer, 2009). The main finding 
is that ShareBoost outperforms the mixed-norm regu- 
larization method when the output predictor needs to 
be very sparse, while mixed-norm regularization can 
be better in the regime of rather dense predictors. We 
also show that ShareBoost is both faster and more ac- 
curate than JointBoost. The third and final set of ex- 
periments is on the MNIST handwritten digit dataset 
where we demonstrate state-of-the-art accuracy at ex- 
tremely efficient runtime performance. 
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Figure 2. The number of features required to achieve a 
fixed accuracy as a function of the number of classes for 
ShareBoost (dashed) and the 1-vs-rest (solid-circles). The 
blue lines are for a target error of 20% and the green lines 
are for 8%. 

7.1. Feature Sharing 

The main motivation for deriving the ShareBoost algo- 
rithm is the need for a multiclass predictor that uses 
only few features, and in particular, the number of 
features should increase slowly with the number of 
classes. To demonstrate this property of ShareBoost 
we experimented with the Char 74k data set which con- 
sists of images of digits and letters. We trained Share- 
Boost with the number of classes varying from 2 classes 
to the 36 classes corresponding to the 10 digits and 26 
capital letters. We calculated how many features were 
required to achieve a certain fixed accuracy as a func- 
tion of the number of classes. The description of the 
feature space is described in Section 7.4. 

We compared ShareBoost to the 1-vs-rest approach, 
where in the latter, we trained each binary classifier 
using the same mechanism as used by ShareBoost. 
Namely, we minimize the binary logistic loss using a 
greedy algorithm. Both methods aim at constructing 
sparse predictors using the same greedy approach. The 
difference between the methods is that ShareBoost se- 
lects features in a shared manner while the 1-vs-rest 
approach selects features for each binary problem sep- 
arately. In Fig. 2 we plot the overall number of features 
required by both methods to achieve a fixed accuracy 
on the test set as a function of the number of classes. 
As can be easily seen, the increase in the number of 
required features is mild for ShareBoost but significant 
for the 1-vs-rest approach. 

7.2. Comparing ShareBoost to Mixed-Norms 
Regularization 

Our next experiment compares ShareBoost to the use 
of mixed-norm regularization (see eqn. (5)) as a surro- 
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gate for the non-convex sparsity constraint. See Sec- 
tion 1.2 for description of the approach. To make the 
comparison fair, we followed the same experimental 
setup as in (Duchi & Singer, 2009) (using code pro- 
vided by ). 

We calculated the whole regularization path for the 
mixed-norm regularization by running the algorithm 
of (Duchi & Singer, 2009) with many values of the 
regularization parameter A. In Fig. 3 we plot the re- 
sults on three UCI datasets: StatLog, Pendigits and 
Isolet. The number of classes for the datasets are 
7,10,26, respectively. The original dimensionality of 
these datasets is not very high and therefore, follow- 
ing (Duchi & Singer, 2009), we expanded the features 
by taking all products over ordered pairs of features. 
After this transformation, the number of features were 
630, 120, 190036, respectively. 

Fig. 3 displays the results. As can be seen, Share- 
Boost decreases the error much faster than the mixed- 
norm regularization, and therefore is preferable when 
the goal is to have a rather sparse solution. When more 
features are allowed, ShareBoost starts to overfit. This 
is not surprising since here sparsity is our only mean 
for controlling the complexity of the learned classifier. 
To prevent this overfitting effect, one can use the vari- 
ant of ShareBoost that incorporates regularization — 
see Section 3. 

7.3. Comparing ShareBoost to JointBoost 

Here we compare ShareBoost to the JointBoost algo- 
rithm of (Torralba et al, 2007). See Section 1.2 for 
description of JointBoost. As in the previous experi- 
ment, we followed the experimental setup as in (Duchi 
& Singer, 2009) and ran JointBoost of (Torralba et al., 
2007) using their published code with additional im- 
plementation of the BFS heuristic for pruning the 2 k 
space of all class-subsets as described in their paper. 

Fig. 3 (bottom) displays the results. Here we used 
stump features for both algorithms since these are 
needed for JointBoost. As can be seen, ShareBoost 
decreases the error much faster and therefore is prefer- 
able when the goal is to have a rather sparse solution. 
As in the previous experiment we observe that when 
more features are allowed, ShareBoost starts to over- 
fit. Again, this is not surprising and can be prevented 
by adding additional regularization. The training run- 
time of ShareBoost is also much shorter than that of 
JointBoost (see discussion in Section 1.2). 



7.4. MNIST Handwritten Digits Dataset 

The goal of this experiment is to show that ShareBoost 
achieves state-of-the-art performance while construct- 
ing very fast predictors. We experimented with the 
MNIST digit dataset, which consists of a training set of 
60, 000 digits represented by centered size-normalized 
28 x 28 images, and a test set of 10, 000 digits (see 
Fig. 6 for some examples). The MNIST dataset has 
been extensively studied and is considered the stan- 
dard test for multiclass classification of handwritten 
digits. The error rate achieved by the most advanced 
algorithms are below 1% of the test set (i.e., below 100 
classification mistakes on the test set). To get a sense 
of the challenge involved with the MNIST dataset, con- 
sider a straightforward 3-Nearest-Neighbor (3NN) ap- 
proach where each test example x, represented as a 
vector with 28 2 entries, is matched against the entire 
training set Xj using the distance <i(x, Xj) = ||x — x^ || 2 . 
The classification decision is then the majority class la- 
bel of the three most nearest training examples. This 
naive 3NN approach achieves an error rate of 2.67% 
(i.e., 267 mis-classification errors) with a run-time of 
unwieldy proportions. Going from 3NN to qNN with 
q = 4, 12 does not produce a better error rate. 

More advanced shape-similarity measures could im- 
prove the performance of the naive qNN approach 
but at a heavier run-time cost. For example, the 
Shape Context similarity measure introduced by (Be- 
longic et al., 2002) uses a Bipartite matching algo- 
rithm between descriptors computed along 100 points 
in each image. A 3NN using Shape-Context similar- 
ity achieves an error rate of 0.63% but at a very high 
(practically unwieldy) run-time cost. The challenge 
with the MNIST dataset is, therefore, to design a mul- 
ticlass algorithm with a small error rate (say below 1%) 
and have an efficient run-time performance. 

The top MNIST performer (Ciresan et al., 2010) uses 
a feed-forward Neural-Net with 7.6 million connec- 
tions which roughly translates to 7.6 million multiply- 
accumulate (MAC) operations at run-time as well. 
During training, geometrically distorted versions of the 
original examples were generated in order to expand 
the training set following (Simard et al., 2003) who in- 
troduced a warping scheme for that purpose. The top 
performance error rate stands at 0.35% at a run-time 
cost of 7.6 million MAC per test example. 

Table 1 summarizes the discussion so far including 
the performance of ShareBoost. The error-rate of 
ShareBoost with 266 rounds stands on 0.71% using 
the original training set and 0.47% with the expanded 
training set of 360, 000 examples generated by adding 
five deformed instances per original example and with 
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Figure 3. ShareBoost compared with mixed-norm regularization (top) and JointBoost (bottom) on several UCI datasets. 
The horizontal axis is the feature sparsity (fraction of features used) and the vertical axis is the test error rate. 



Reference 


3NN 


Shape Context 
Belongie-et-al 


SVM 9-poly 
DeCosta-et-al 


Neural Net 
Ciresan-et-al 


ShareBoost 


Error rate 


2.7% 


0.63% 


0.56% 


0.35% 


0.47% 


Errors 


270 


63 


56 


35 


47 


Year 




2002 


2002 


2010 


2011 


Run time 


x 14 


x 1000's 


x 38 


x 2.5 


1 



Table 1. Comparison of ShareBoost and relevant methods on error rate and computational complexity over the MNIST 
dataset. More details in the text. 
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T = 305 rounds. The run-time on test examples is 
around 40% of the leading MNIST performer. The er- 
ror rate of 0.47% is better than that reported by (De- 
coste & Bernhard, 2002) who used a 1-vs-all SVM with 
a 9-degree polynomial kernel and with an expanded 
training set of 780, 000 examples. The number of sup- 
port vectors (accumulated over the ten separate bi- 
nary classifiers) was 163,410 giving rise to a run-time 
of 21-fold compared to ShareBoost. We describe below 
the details of the ShareBoost implementation on the 
MNIST dataset. 

The feature space we designed consists of 7x7 image 
patches with corresponding spatial masks, constructed 
as follows. All 7x7 patches were collected from all 
images and clustered using K-means to produce 1000 
centers Wf. For each such center (patch) we also asso- 
ciated a set of 16 possible masks 3/ in order to limit 
the spatial locations of the maximal response of the 
7x7 patch. The pairs F = {(vf,gf)} form the pool of 
d = 16,000 templates (shape plus location). The vec- 
tor of feature measurements x G R m = (. . . , xt c , . . .) 
has each of its entries associated with one of the tem- 
plates where an entry Xf c = max |(/ <g> Wf) x fl^|- 

That is, a feature is the maximal response of the con- 
volution of the template Wf over the image, weighted 
by the Gaussian g'j. 

ShareBoost selects a subset of the templates ji , . . . , jr 
where each j% represents some template pair (w f i , g'j?.), 
and the matrix W € R kxT . A test image I is then con- 
verted to x e R T using ii — max{(I(g)u>/ 4 ) x gj.} with 
the maximum going over the image locations. The pre- 
diction y is then argmax y6 r fe ](M / x) !; . Fig. 5(a) shows 
the first 30 templates that were chosen by ShareBoost 
and their corresponding spatial masks. For example, 
the first templates matches a digit part along the top 
of the image, the eleventh template matches a hori- 
zontal stroke near the top of the image and so forth. 
Fig. 5(b) shows the weights (columns of W) of the 
first 30 templates of the model that produced the best 
results. For example, the eleventh template which en- 
codes a horizontal line close to the top is expected in 
the digit "9" but not in the digit "4". Fig. 6 shows 
the 47 misclassified samples after T = 305 rounds of 
ShareBoost, and Fig. 4 displays the convergence curve 
of error-rate as a function of the number of rounds. 

In terms of run-time on a test image, the system re- 
quires 305 convolutions of 7 x 7 templates and 540 
dot-product operations which totals to roughly 3.3- 10 6 
MAC operations — compared to around 7.5 • 10 6 MAC 
operations of the top MNIST performer. Moreover, 
due to the fast convergence of ShareBoost, 75 rounds 
are enough for achieving less than 1% error. Further 




Figure 4. Convergence of Shareboost on the MNIST 
dataset as it reaches 47 errors. The set was expanded with 
5 deformed versions of each input, using the method in 
(Simard et al., 2003). Since the deformations are fairly 
strong, the training error is higher than the test. Zoomed 
in version shown on the right. 

improvements of ShareBoost on the MNIST dataset 
are possible such as by extending further the train- 
ing set using more deformations and by increasing the 
pool of features with other type of descriptors - but 
those were not pursued here. The point we desired 
to make is that ShareBoost can achieve competitive 
performance with the top MNIST performers, both in 
accuracy and in run-time, with little effort in feature 
space design while exhibiting great efficiency during 
training time as well. 

7.5. Comparing ShareBoost to kernel-based 
SVM 

In the experiments on the MNIST data set reported 
above, each feature is the maximal response of the con- 
volution of a 7 x 7 patch over the image, weighted by 
a spatial mask. 

One might wonder if the stellar performance of Share- 
Boost is maybe due to the patch-based features we 
designed. In this section we remove doubt by using 
ShareBoost for training a piece-wise linear predictor, 
as described in Section 4.2, on MNIST using generic 
features. We show that ShareBoost comes close to 
the error rate of SVM with Gaussian kernels, while 
only requiring 230 anchor points, which is well below 
the number of support- vectors needed by kernel-SVM. 
This underscores the point that ShareBoost can find 
an extremely fast predictor without sacrificing state- 
of-the-art performance level. 

Recall that the piece-wise linear predictor is of the 
following form: 

ft(x) = argmax ( V 1 [||x - v (j) || < r u) ] (W^'.'x + b ( y j) ) ) 

where € JR d are anchor points with radius of 
influence and W^\b^ define together a linear 
classifier for the j'th anchor. ShareBoost selects the 
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(a) Leading 30 selected features 
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Figure 5. (a) The first 30 selected features for the MNIST dataset. Each feature is composed of a 7x7 template and a 
position mask, (b) The corresponding columns of W . The entries of a column represents the "sharing" among classes 
pattern. For example, the eleventh template which encodes a horizontal line close to the top is expected in the digits 
"9,8,5" but not in digit "4". 
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Figure 6. 


ShareBoost achieves an error of 0.47% on the test 




set which translates to 47 mistakes displayed above. Each 
error test example is displayed together with its predicted 
and True labels. 



Figure 7. Test accuracy of ShareBoost on the MNIST 
dataset as a function of the number of rounds using the 
generic piece-wise linear construction. Blue: train accu- 
racy. Red: test accuracy. Dashed: SVM with Gaussian 
kernel accuracy. 
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set of anchor points and their radiuses together with 
the corresponding linear classifiers. In this context 
it is worthwhile to compare classification performance 
to SVM with Gaussian kernels applied in a 1-vs-all 
framework. Kernel-SVM also selects a subset of the 
training set S with corresponding weight coefficients, 
thus from a mechanistic point of view our piece-wise 
linear predictor shares the same principles as kernel- 
SVM. 

We performed a standard dimensionality reduction us- 
ing PCA from the original raw pixel dimension of 28 2 
to 50, i.e., every digit was mapped to x 6 R 50 us- 
ing PCA. The pool of anchor points was taken from 
a reduced training set by means of clustering S into 
1500 clusters and the range of radius values per an- 
chor point was taken from a discrete set of 35 values. 
Taken together, each round of ShareBoost selected an 
anchor point and radius r^) from a search space of 
size 52500. Fig. 7 shows the error-rate per ShareBoost 
rounds. As can be seen, ShareBoost comes close to 
the error rate of SVM while only requiring 230 anchor 
points, which is well below the number of support- 
vectors needed by kernel-SVM. This underscores the 
point that ShareBoost can find an extremely fast pre- 
dictor without sacrificing state-of-the-art performance 
level. 
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A. Proofs 

A.l. Proof of Theorem 1 

The proof is based on an analysis of the Natarajan dimension of the class of matrices with small number of 
non-zero columns. The Natarajan dimension is a generalization of the VC dimension for classes of multiclass 
hypotheses. In particular, we rely on the analysis given in Theorem 25 and Equation 6 of (Daniely et al., 2011). 
This implies that if the set of T columns of W are chosen in advance then 

P Jh w (x) ? y] < P [h w (x) ^y]+0 (VTHog(Tfc) log(fc) + \og(l / 5) / ^\S\) . 

Applying the union bound over all (T) options to choose the relevant features we conclude our proof. 
A.2. Proof of Theorem 2 

To prove the theorem, we start by establishing a certain smoothness property of L. First, we need the following. 

Lemma 1 . Let £ : M. k —> M. be defined as 

£(v) = log I 1 + e 1 -^'^ 
V ie[k]\{j} 

Then, for any u, v we have 

£(u + v)<£(u) + (W(u),v) + ||v||^ . 
Proof Using Taylor's theorem, it suffices to show that the Hessian of I at any point satisfies 

vtffv < 2||v|& • 

Consider some vector w and without loss of generality assume that j = 1. We have, 

;i=2 ° dof 



0*(w) _ T,t_ el ~ wi+ 



«1 



dwi i j- V fc pi-wi+wp 

and for i > 2 

dl(w) _ e i-^+^ drf 

dwi 1 + J2l =2 e 1 ' Wl+w f 

Note that — ot\ = Ei=2 ai — 1j an< ^ that f° r * — 2, on > 0. Let H be the Hessian of I at w. It follows that for 

i > 2, 

1 — wx+Wi ( 1 — w-^+Wi\2 

tii i = i —. = ai — ai . 

In addition, for j^i where both j and i are not 1 we have 

^ = (EL .e'-^-) 2 = ' 



For i = 1 we have 
and for z > 1 



#1,1 = -oil - at\ 



We can therefore rewrite H as 

H = -aa' + diag([-cn, a 2 , . . . , a k \) - ei[0, ai, . . . , a k ] 
-[0,ai,...,a k ] t (ei) t . 
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It thus follows that: 

v'fiv = —((a, v)) 2 — ot\v\ + otiv\ — 2v\ ctiVi 

i>l »>1 

<0 + J2 a *( v i ~v 2 1 ~2v 1 v l ) 
»>i 

= ^a. i ((f l -«i) 2 -2« 2 ) 

i>l 

<2maxv 2 = 2||v||£, , 

where the last step is because for any Vi G [— c, c], the function f(v\) = (vi — t>i) 2 — 2i> 2 receives its maximum 
when «x — ~ v i an d then its value is 2u 2 . This concludes our proof. ■ 

The above lemma implies that L is smooth in the following sense: 

Lemma 2 For any W,U s.t. U — uej (that is, only the r'th column of U is not zero) we have that 

L(W -U)< L(W) - (VL(W), U) + Hull 2 , . 

Proof Recall that L(W) is the average over (x, y) of a function of the form £(Wx), where £ is as defined in 
Lemma 1. Therefore, 

£({W + U)x) < £{Wx) + (V£(Wx), Ux) + \\Ux\\l„ 

= £{Wx) + (V£(Wx), Ux) + lavHluH^ 
< £{Wx) + (V£{Wx),Ux) + ||u||^ , 

where the last inequality is because we assume that HxH^ < 1 for all x. The above implies that 

L(W -U)< L(W) - (VL(W), U) + ||u|& • (8) 



Equipped with the smoothness property of L, we now turn to show that if the greedy algorithm has not yet 
identified all the features of W* then a single greedy iteration yields a substantial progress. We use the notation 
supp(VK) to denote the indices of columns of W which are not all-zeros. 

Lemma 3 Let F, F be two subsets of [d] such that F — F ^ and let 

W = argmin L(V) , W* = argmin L(V) . 

V:supp(V)=F V:supp(V)=F 



Then, ifL(W) > L(W*) we have 



t . ^ (L(W) - L(W*)f 



L(W)-mmL(W + ue]) > 
where j — argmax^ ||ViL(W)||i- 

Proof To simplify notation, denote F c = F — F. Using Lemma 2 we know that for any u: 

L(W - uet) < L(W) - (VL(W), uet) + ||u||^ , 
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In particular, the above holds for the vector of u = jL(W)\\iSgn(WjL(W)) and by rearranging we obtain 
that 

L(W) - L(W - uet) > (WL(W),ue]) - ||u||^ 

= liiv^wn? . 

It is therefore suffices to show that 



) 2 

DO I 



4 " HEteF-rWW*, 

Denote s — J2jeF<= l|W*jll°o! then an equivalent inequality 3 is 

s||VjL(VK)||i > L(W) - L(W*) . 

From the convexity of L, the right-hand side of the above is upper bounded by (VL(VU), W — W*). Hence, it is 
left to show that 

s||Vji(V7)||i > (VL(W),W- W*) . 

Since we assume that W is optimal over F we get that ViL(W) = for all i € F, hence (V L(W),W) = 0. 
Additionally, W*. l = for i <g F. Therefore, 



(VLQV), W - W*) = - J2 WiHW), W*i) 
ieF c 

< \\^ l L(W)\\ 1 \\W* l 



i£F<= 

< s max||V l L(W)||i 

i 

= s\\V j L(W)\\ 1 , 

and this concludes our proof. 



Using the above lemma, the proof of our main theorem easily follows. 

Proof [of Theorem 2] Denote e t = L(W^) — L(W*), where is the value of W at iteration t. The definition 
of the update implies that L(VT^ t+1 - ) ) < mini jU L(W^ + ue|). The conditions of Lemma 3 hold and therefore 
we obtain that (with F = F&) 

e t -e t+1 =L( W U)-L(W^)> — 4 

4 (E i6 i?__P ll^illoo) (g) 



411^*1130,1 



Using Lemma B.2 from (Shalcv-Shwartz et al., 2010), the above implies that for t > 4 HW*!!^, 1 /e we have that 
et < e, which concludes our proof. ■ 



3 This is indeed equivalent because the lemma assumes that L(W) > L(W*) 



